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Abstract 

In this paper, we consider a class of continuous-time, continuous-space stochastic optimal 
control problems. Building upon recent advances in Markov chain approximation methods and 
sampling-based algorithms for deterministic path planning, we propose a novel algorithm called 
the incremental Markov Decision Process (iMDP) to compute incrementally control policies that 
approximate arbitrarily well an optimal policy in terms of the expected cost. The main idea 
behind the algorithm is to generate a sequence of finite discretizations of the original problem 
through random sampling of the state space. At each iteration, the discretized problem is a 
Markov Decision Process that serves as an incrementally refined model of the original problem. 
We show that with probability one, (i) the sequence of the optimal value functions for each of the 
discretized problems converges uniformly to the optimal value function of the original stochastic 
optimal control problem, and (ii) the original optimal value function can be computed efficiently 
O in an incremental manner using asynchronous value iterations. Thus, the proposed algorithm 

provides an anytime approach to the computation of optimal control policies of the continuous 
^-H problem. The effectiveness of the proposed approach is demonstrated on motion planning and 

^ control problems in cluttered environments in the presence of process noise. 

1 Introduction 

Stochastic optimal control has been an active research area for several decades with many applica- 
tions in diverse fields ranging from finance, management science and economics [U [2] to biology [3] 
I and robotics [4J. Unfortunately, general continuous-time, continuous-space stochastic optimal con- 

^ trol problems do not admit closed-form or exact algorithmic solutions and are known to be compu- 

tationally challenging [5j . Many algorithms are available to compute approximate solutions of such 
problems. For instance, a popular approach is based on the numerical solution of the associated 

^ Hamilton- Jacobi-Bellman PDE (see, e.g., [UEKH]). Other methods approximate a continuous prob- 

lem with a discrete Markov Decision Process (MDP), for which an exact solution can be computed 
in finite time [9l[T0]. However, the complexity of these two classes of deterministic algorithms scales 
exponentially with the dimension of the state and control spaces, due to discretization. Remarkably, 
algorithms based on random (or quasi-random) sampling of the state space provide a possibility 
to alleviate the curse of dimensionality in the case in which the control inputs take values from a 
finite set, as noted in [Til 1121 E]. 

Algorithms based on random sampling of the state space have recently been shown to be very 
effective, both in theory and in practice, for computing solutions to deterministic path planning 
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problems in robotics and other disciplines. For example, the Probabilistic RoadMap (PRM) algo- 
rithm first proposed by Kavraki et al. [TSj was the first practical planning algorithm that could han- 
dle high-dimensional path planning problems. Their incremental counterparts, such as RRT |14j . 
later emerged as sampling-based algorithms suited for online applications and systems with differ- 
ential constraints on the solution (e.g., dynamical systems). The RRT algorithm has been used in 
many applications and demonstrated on various robotic platforms I16j . Recently, optimality 
properties of such algorithms were analyzed in [T7j. In particular, it was shown that the RRT algo- 
rithm fails to converge to optimal solutions with probability one. The authors have proposed the 
RRT* algorithm which guarantees almost-sure convergence to globally optimal solutions without 
any substantial computational overhead when compared to the RRT. 

Although the RRT* algorithm is asymptotically optimal and computationally efficient (with 
respect to RRT), it can not handle problems involving systems with uncertain dynamics. In this 
work, building upon the Markov chain approximation method |18j and the rapidly-exploring sam- 
pling technique [T^ , we introduce a novel algorithm called the incremental Markov Decision Process 
(iMDP) to approximately solve a wide class of stochastic optimal control problems. More precisely, 
we consider a continuous-time optimal control problem with continuous state and control spaces, 
full state information, and stochastic process noise. In iMDP, we iteratively construct a sequence of 
discrete Markov Decision Processes (MDPs) as discrete approximations to the original continuous 
problem, as follows. Initially, an empty MDP model is created. At each iteration, the discrete 
MDP is refined by adding new states sampled from the boundary as well as from the interior of 
the state space. Subsequently, new stochastic transitions are constructed to connect the new states 
to those already in the model. For the sake of efficiency, stochastic transitions are computed only 
when needed. Then, an anytime policy for the refined model is computed using an incremental 
value iteration algorithm, based on the value function of the previous model. The policy for the 
discrete system is finally converted to a policy for the original continuous problem. This process is 
iterated until convergence. 

Our work is mostly related to the Stochastic Motion Roadmap (SMR) algorithm |19J and Markov 
chain approximation methods [T5] . The SMR algorithm constructs an MDP over a sampling-based 
roadmap representation to maximize the probability of reaching a given goal region. However, in 
SMR, actions are discretized, and the algorithm does not offer any formal optimality guarantees. On 
the other hand, while available Markov chain approximation methods [18j provide formal optimality 
guarantees under very general conditions, a sequence of a priori discretizations of state and control 
spaces still impose expensive computation. The iMDP algorithm addresses this issue by sampling 
in the state space and sampling or discovering necessary controls. 

The main contribution of this paper is a method to incrementally refine a discrete model of 
the original continuous problem in a way that ensures convergence to optimality while maintaining 
low time and space complexity. We show that with probability one, the sequence of optimal 
value functions induced by optimal control policies for each of the discretized problems converges 
uniformly to the optimal value function of the original stochastic control problem. In addition, 
the optimal value function of the original problem can be computed efficiently in an incremental 
manner using asynchronous value iterations. Thus, the proposed algorithm provides an anytime 
approach to the computation of optimal control policies of the continuous problem. Distributions of 
approximating trajectories and control processes returned from the iMDP algorithm approximate 
arbitrarily well distributions of optimal trajectories and optimal control processes of the original 
problem. Each iteration of the iMDP algorithm can be implemented with the time complexity 
0{k^ log k) where < 9 < 1 while the space complexity is 0{k), where k is the number of states in 
an MDP model in the algorithm which increases linearly due to the sampling strategy. Thus, the 
entire processing time until the algorithm stops can be implemented in 0{k^^^ logk). Hence, the 
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above space and time complexities make iMDP a practical incremental algorithm. The effectiveness 
of the proposed approach is demonstrated on motion planning and control problems in cluttered 
environments in the presence of process noise. 

This paper is organized as follows. In Section [2| a formal problem definition is given. The 
Markov chain approximation methods and the iMDP algorithm are described in Sections [3] and |4j 
The analysis of the iMDP algorithm is presented in Section [5] Section [6] is devoted to simulation 
examples and experimental results. The paper is concluded with remarks in Section [7| We provide 
additional notations and preliminary results as well as proofs for theorems and lemmas in Appendix. 

2 Problem Definition 

In this section, we present a generic stochastic optimal control problem. Subsequently, we discuss 
how the formulation extends the standard motion planning problem of reaching a goal region while 
avoiding collision with obstacles. 

Stochastic Dynamics Let d^, du, and dw be positive integers. The ^^.-dimensional and du- 
dimensional Euclidean spaces are M'^"' and M*^" respectively. Let be a compact subset of M'^'", 
which is the closure of its interior S° and has a smooth boundary dS. The state of the system at 
time t is x{t) G S, which is fully observable at all times. We also define a compact subset U of M'^" 
as a control set. 

Suppose that a stochastic process {'w{t);t > 0} is a d^-dimensional Brownian motion, also 
called a Wiener process, on some probability space {Q.,F,V). Let a control process {u{t);t > 0} 
be a [/-valued, measurable process also defined on the same probability space. We say that the 
control process u{-) is nonanticipative with respect to the Wiener process w(-) if there exists a 
filtration > 0} defined on {Vl^F^V) such that u{-) is -F^-adapted, and w{-) is an J^f-Wiener 

process. In this case, we say that n(-) is an admissible control inputs with respect to or the 
pair {u{-),w{-)) is admissible. Let W^^^'^^' denote the set of all dx by dw real matrices. We consider 
stochastic dynamical systems, also called controlled diffusions, of the form 

dx{t) = f{x{t),u{t))dt + F{x{t),u{t))dw{t), yt>0 (!) 

where f : S xU — t- M'^^ and F : S xU ^ ]^dxxd^ ^^^^ bounded measurable and continuous functions 
as long as x{t) G S°. The matrix F(-, ■) is assumed to have full rank. More precisely, a solution to 
the differential form given in Eq. ([T]) is a stochastic process {x{t);t > 0} such that x{t) equals the 
following stochastic integral in all sample paths: 

X{t)=x{0)+ [ f{x{T),u{T))dT+ [ F{x{T),u{T))dw{T), (2) 

Jo Jo 

until x(-) exits 5"°, where the last term on the right hand side is the usual Ito integral (see, e.g., |20j). 
When the process x(-) hits dS, the process x(-) is stopped. 

Weak Existence and Weak Uniqueness of Solutions Let T be the sample path space of 
admissible pairs {u{-),w{-)). Suppose we are given probability measures A and Pq on F and on 
S respectively. We say that solutions of ^ exist in the weak sense if there exists a probability 
space {Q,F,V), a filtration {Ft;t > 0}, an J^t-Wiener process w(-), an J^^-adapted control process 
ti(-), and an /"^-adapted process x(-) satisfying Eq. ([2]), such that A and Pq are the distributions 
of {u{-),w{-)) and x(0) under V. We call such tuple {{Q,,F,V),Ft,w{-),u{-),x{-)} a weak sense 
solution of Eq. ^IH]. 
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Assume that we are given weak sense solutions {{Qi, Ti,Vi) , Tt,!, 'Wi{-), Ui{-) , Xi{-)} , i = 1,2, to 
Eq. ([l]). We say solutions are weakly unique if equality of the joint distributions of {wi{-),Ui{-),Xi{0)) 
under Vi, i = 1,2, implies the equality of the distributions {xi{-),Wi{-),Ui{-), Xi{0)) under Vi, 
i = 1,2 [211118]. 

In this paper, given the boundedness of the set S, and the definition of the functions / and F in 
Eq. ([T]) , we have a weak solution to Eq. ([T]) that is unique in the weak sense [21| . The boundedness 
requirement is naturally satisfied in many applications and is also needed for the implementation of 
the proposed numerical method. We will also handle the case in which / and F are discontinuous 
with extra mild technical assumptions to ensure asymptotic optimality in Section |3j 

Policy and Cost-to-go Function A particular class of admissible controls, called Markov con- 
trols, depends only on the current state, i.e., u(t) is a function only of x{t), for all t >0. It is well 
known that in control problems with full state information, the best Markov control performs as 
well as the best admissible control (see, e.g., [20l[21]). A Markov control defined on 5 is also called 
a policy, and is represented by the function fi : S ^ U. The set of all policies is denoted by 11. 
Define the first exit time : 11 — )• [0, +oo] under policy /i as 

Tf, = inf {t : x{t) ^ S" and Eq. ^ and u{t) = fi{x{t))}. 

Intuitively, is the first time that the trajectory of the dynamical system given by Eq. ^ with 
u{t) = fj.{x{t)) hits the boundary dS of S. By definition, = +oo if x(-) never exits S°. Clearly, 
Tfj_ is a random variable. Then, the expected cost-to-go function under policy is a mapping from 
5 to M defined as 



a 







'g{x{t),fi{x{t))) dt + h{x{T^)) I x(0) = z 



where g : S xU and h : S ^M. are bounded measurable and continuous functions, called the 
cost rate function and the terminal cost function, respectively, and a G [0, 1) is the discount rate. 
We further assume that g{x, u) is uniformly Holder continuous in x with exponent 2p € (0, 1] for 
all u £ U. That is, there exists some constant C > such that 

\g{x,u) — g{x',u)\ < C\\x — x'Hg'', Vx,x' G S. 

We will address the discontinuity of g and h in Section [3) 

The optimal cost-to-go function J* : 5 — )• M is defined as J*{z) = inf^gn Jiii^) for all z £ S. A 
policy fj,* is called optimal if J^* = J*. For any e > 0, a policy fj, is called an e-optimal policy if 
II / _ <: f 

W'^ fi I |oo ^ 

In this paper, we consider the problem of computing the optimal cost-to-go function J* and 
an optimal policy /u* if obtainable. Our approach, outlined in Section |4| approximates the optimal 
cost-to-go function and an optimal policy in an anytime fashion using incremental sampling-based 
algorithms. This sequence of approximations is guaranteed to converge uniformly to the optimal 
cost-to-go function and to find an e-optimal policy for an arbitrarily small non-negative e, almost 
surely, as the number of samples approaches infinity. 



Relationship with Standard Motion Planning The standard motion planning problem of 
finding a collision- free trajectory that reaches a goal region for a deterministic dynamical system 
can be defined as follows (see, e.g., [17J). Let X C M°'^ be a compact set. Let the open sets 
Afobs and denote the obstacle region and the goal region, respectively. Define the obstacle- 

free space as Xir^e := \ Afobs- Let Xinit S '^free- Consider the deterministic dynamical system 
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X = f{x{t),u{t))dt, where f : X x U — )• M'^^. The feasible motion planning problem is to find a 
measurable control input it : [0, T] — )• f/ such that the resulting trajectory x[t) is collision free , i.e., 
x{t) G Affree and reaches the goal region, i.e., x{T) £ Xgoai- The optimal motion planning problem is 
to find a measurable control input u such that the resulting trajectory x solves the feasible motion 
planning problem with minium trajectory cost. 

The problem considered in this paper extends the classical motion planning problem with 
stochastic dynamics as described by Eq. ([T]). Given a goal set <^goai and an obstacle set Afobs, 
define S := X \ (Afgoai U ^Vohs) and thus dXg^ai U dX^hs U dX = dS. Due to the nature of Brownian 
motion, under most policies, there is some non-zero probability that collision with an obstacle set 
will occur. However, to penalize collision with obstacles in the control design process, the cost 
of terminating by hitting the obstacle set, i.e., h{z) for z £ 5Afobs; can be made arbitrarily high. 
Clearly, the higher this number is, the more conservative the resulting policy will be. Similarly, 
the terminal cost function on the goal set, i.e., h{z) for z £ SAfgoai, can be set to a small value to 
encourage terminating by hitting the goal region. 



3 Markov Chain Approximation 

A discrete-state Markov decision process (MDP) is a tuple M = {X, A, P, G, H) where X is a finite 
set of states, ^ is a set of actions that is possibly a continuous space, P{- | •, •) : X x X x A ^ M>o 
is a function that denotes the transition probabilities satisfying Yl^'ex ^i^' 1^^''^) — ^ ^r all £ X 
and all f £ A, G(-, •) : X x j4 — )• M is an immediate cost function, and H : X is a terminal cost 
funtion. If we start at time with a state .^o £ X, and at time z > 0, we apply an action Vi £ A at 
a state to arrive at a next state ^j+i according to the transition probability function P, we have 
a controlled Markov chain {^j; z G N}. The chain {^f, i £N} due to the control sequence {vi;i £ N} 
and an initial state will also be called the trajectory of A4 under the said sequence of controls 
and initial state. 

Given a continuous-time dynamical system as described in Eq. ([T]) , the Markov chain approxima- 
tion method approximates the continuous stochastic dynamics using a sequence of MDPs {Ain}'^=o 
in which Ain = {Sn, U, Pn, Gn, Hn) where Sn is a discrete subset of S, and U is the original control 
set. We define dSn = dSCiSn- For each n £ N, let {^"; z G N} be a controlled Markov chain on Ain 
until it hits dSn- We associate with each state z in S* a non- negative interpolation interval At„(z), 
known as a holding time. We define tf = ^lo"^ ^tn{^?) for i > 1 and t[f = 0. Let A^f = ^'Y^ - . 
Let u'^ denote the control used at step i for the controlled Markov chain. In addition, we define 
Gn{z, v) = g{z, v)Atn{z) and Hn{z) = h{z) for each z £ Sn and v £ U. Let Qn be the sample space 
of A4n- Holding times At„ and transition probabilities Pn are chosen to satisfy the local consistency 
property given by the following conditions: 

1. For all z£ S, 

lim Atn{z) = 0, (3) 

n— >oo 

2. For all z G 5 and all v £U: 

f{z,v), (4) 

F{z,v)F{z,vf, (5) 
0. (6) 



hm ' 



lim 



Atniz) 

CovpJAgf |g = ^,< = ^] 

Atn{z) 

lim sup ||A4"||2 



5 



The chain {^";^ G N} is a discrete-time process. In order to approximate the continuous- 
time process x(-) in Eq. Q, we use an approximate continuous-time interpolation. We define the 
(random) continuous-time interpolation ^"(•) of the chain {i^;i G N} and the continuous-time 
interpolation u^{-) of the control sequence {u^',i G N} under the holding times function At„ as 
follows: = ii, and n"(r) = < for all r G [t^t^+i). Let Z)'^-[0,+oo) denote the set of all 

M"^^ -valued functions that are continuous from the left and has limits from the right. The process 
^'^ can be thought of as a random mapping from i7„ to the function space Z)'^^[0, +cxd). 

A control problem for the MDP is analogous to that defined in Section [2j Similar to 
previous section, a policy /i^ is a function that maps each state z G 5„ to a control ^n{z) G U . The 
set of all such policies is n„. Given a policy the (discounted) cost-to-go due to /i„ is: 



. 1=0 



where Ep^ denotes the conditional expectation under Pn, the sequence {^f;i G N} is the controlled 
Markov chain under the policy and /„ is termination time defined as In = min{z : G dSn}- 
The optimal cost function, denoted by J* satisfies 

J*{z)= inf J„,^„(2:), \/zeSn. (7) 

An optimal policy, denoted by ^* , satisfies Jn,^* (-z) = JjJ(2:) for all z G 5„. For any e > 0, fin is an 
e-optimal policy if \ \Jn,fi„ - JnWoo < £• 

As stated in the following theorem, under mild technical assumptions, local consistency implies 
the convergence of continuous-time interpolations of the trajectories of the controlled Markov chain 
to the trajectories of the stochastic dynamical system described by Eq. ([T]). 

Theorem 1 (see Theorem 10.4.1 in [18j) Let us assume that f ■) andF{-,-) are measurable, 
bounded and continuous. Thus, Eq. Q has a weakly unique solution. Let {7W„}^q be a sequence of 
MDPs, and {At„}^g be a sequence of holding times that are locally consistent with the stochastic 
dynamical system described by Eq. ([T]). Let G N} 6e a sequence of controls defined for 

each n G N. For all n G N, let {i^{t)]t G M>o} denote the continuous-time interpolation to the 
chain G N} under the control sequence {u^]i G N} starting from an initial state Zinit, and 

{u"'{t)]t G M>o} denote the continuous-time interpolation of{uf;i G N}, according to the holding 
time Atn. Then, any subsequence o/ {(^"(•), m"(-))}^o ^'^■^ a further subsequence that converges 
in distribution to (x(-),?i(-)) satisfying 

x{t) = Zinit+ [ f{x{T),u{T))dT+ I F{x{T),u{T))dw{T). 

Jo Jo 

Under the weak uniqueness condition for solutions of Eq. ([T]), the sequence {{^"' {■) , u"' {■))}'^=q also 
converges to (x(-),u(-)). 

Furthermore, a sequence of minimizing controls guarantees pointwise convergence of the cost func- 
tion to the original optimal cost function in the following sense. 

Theorem 2 (see Theorem 10.5.2 in [18) ^ Assume that /(•,•), F{-,-), g{-,-) and h{-) are mea- 
surable, bounded and continuous. For any trajectory x{-) of the system described by Eq. ([T|, define 
f{x) := inf{t : x{t) ^ S°}. Let {Mn = {Sn,U, Pn, Gn, Hn)}'^=Q and {Atn}^=o locally consistent 
with the system described by Eq. ([T]) . 
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We suppose that the function f(-) is continuous (as a mapping from D'^^[0, +00) to the com- 
pactified interval [0, +oo]J with probability one relative to the measure induced by any solution to 
Eq. ([1]) for an initial state z, which is satisfied when the matrix F{-, ■)F{-, ■)'^ is nondegenerate. 
Then, for any z G Sn, the following equation holds: 

lim |J*(z) - r{z)\ = 0. 

n— ^-oo 

In particular, for any z G Sn, for any sequence {e„ > 0}^q such that lim„_5.oo £■« = 0, and for any 
sequence of policies {/x„}^q such that fjsfi is CLTi Cfi-optifficil policy of .AAyi? have.' 

lim |J„,^„(z)- J*(z)| =0. 
n— >-oo 

Moreover, the sequence {tf^; n G N} converges in distribution to the termination time of the optimal 
control problem for the system in Eq. ([T]) when the system is under optimal control processes. 

Under the assumption that the cost rate g is Holder continuous [22] with exponent 2p, the sequence 
of optimal value functions for approximating chains J* indeed converges uniformly to J* with a 
proven rate. Let us denote = sup2g5^ b{x) as the sup- norm over 5„ of a function b with 

domain containing Sn- Let 

Cn = max min \\z' — z\\2 (8) 

be the dispersion of Sn- 

Theorem 3 (see Theorem 2.3 in [23j and Theorem 2.1 in [24j ) Consider an MDP sequence 
{TWn = {Sn-, U, Pn, Gn, Hn)}'i^=o ^'"''^ holding times {At„}^Q that are locally consistent with the sys- 
tem described by Eq. ([T]). Let J* be the optimal cost of Ain- Given the assumptions on the dynamics 
and cost rate functions in Section^ as n approaches 00, we have 

\\J*n-J*\h.=0{C)- 



Discontinuity of dynamics and objective functions 

We note that the above theorems continue to hold even when the functions f,F,g, and h are 
discontinuous. In this case, the following conditions are sufficient to use the theorems: (i) For r to 
be /, F, g, or h, r{x, u) takes either the form rQ[x) +ri{u) or rQ{x)ri[u) where the control dependent 
terms are continuous and the x-dependent terms are measurable, and (ii) /(x, •), F{x, •),g{x, •), and 
h{x) are nondegenerate for each x, and the set of discontinuity in x of each function is a uniformly 
smooth surface of lower dimension. Furthermore, instead of uniform Holder continuity, the cost 
rate g can be relaxed to be locally Holder continuous with exponent 2p on S (see, e.g., page 275 
inHH). 

Let us remark that the controlled Markov chain differs from the stochastic dynamical systems 
described in Section [2] in that the former possesses a discrete state structure and evolves in a 
discrete time manner while the latter is a continuous model both in terms of its state space and 
the evolution of time. Yet, both models possess a continuous control space. It will be clear in the 
following discussion that the control space does not have to be discretized if a certain optimization 
problem can be solved numerically or via sampling. 

The above theorems assert the asymptotic optimality given a sequence of a priori discretizations 
of the state space and the availability of e-optimal policies. In what follows, we describe an algorithm 
that incrementally computes the optimal cost-to-go function and an optimal control policy of the 
continuous problem. 
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4 The iMDP Algorithm 



Based on Markov chain approximation results, the iMDP algorithm incrementally builds a sequence 
of discrete MDPs with probability transitions and cost-to-go functions that consistently approxi- 
mate the original continuous counterparts. The algorithm refines the discrete models by using a 
number of primitive procedures to add new states into the current approximate model. Finally, 
the algorithm improves the quality of discrete-model policies in an iterative manner by effectively 
using the computations inherited from the previous iterations. Before presenting the algorithm, 
some primitive procedures which the algorithm relies on are presented in this section. 



4.1 Primitive Procedures 
4.1.1 Sampling 

The SampleO and SampleBoundary() procedures sample states independently and uniformly from 
the interior 5"° and the boundary dS, respectively. 



4.1.2 Nearest Neighbors 

Given z £ S and a set y C 5 of states. For any /c G N, the procedure Nearest (z, Y, k) returns the 
k nearest states z' that are closest to z in terms of the Euclidean norm. 



4.1.3 Time Intervals 

Given a state z E 5 and a number A; G N, the procedure ComputeHoldingTiine(2;, /;:) returns a 
holding time computed as follows: 

, (logkV'P'''^ 
ComputeHoldingTime(z, Kj = 7t I — — 1 , 

where 7t > is a constant, and ^, 6 are constants in (0, 1) and (0, 1] respectivel}|^ The parameter 
p G (0,0.5] defines the Holder continuity of the cost rate function g{-,-) as in Section [2j 



4.1.4 Transition Probabilities 

Given a state z G S*, a subset y G S*, a control v €z U, and a positive number r describing a 
holding time, the procedure ComputeTranProb(z, r, y) returns (i) a finite set Z^ear C 5 of states 
such that the state z + f{z,v)T belongs to the convex hull of Z^^ar and \\z' — z\\2 = 0{t) for all 
z' ^ z £ Z^car, and (ii) a function p that maps ^near to a non- negative real numbers such that 
is a probability distribution over the support .^near- It is crucial to ensure that these transition 
probabilities result in a sequence of locally consistent chains in the algorithm. 

There are several ways to construct such transition probabilities. One possible construction 
by solving a system of linear equations can be found in [18]. In particular, we choose Z^ear = 
Nearest(z + f{z,v)T,Y,s) where s G N is some constant. We define the transition probabilities 
p : Zjicar — ^ IR>o that satisfies: 

(i) -z) = f{z,v)T + o(r), 

(ii) - - zf = F{z,v)F{z,vfr + f{z,v)fiz,vfr' + o(r). 
^Typical values of ? is [0.999,1). 
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(iii) E.'ez„_P(^0 = 1- 

An alternate way to compute the transition probabilities is to approximate using local Gaussian 
distributions. We choose Zncar = Nearest(2; + f{z,v)T,Y,s) where s = 0(log(|y|)). Let J\fm,a[-) 
denote the density of the (possibly multivariate) Gaussian distribution with mean fn and variance 
a. Define the transition probabilities as follows: 

p{z ) - 



where fn = z + f{z, v)t and a = F{z, v)F{z, v)'^t. This expression can be evaluated easily for any 
fixed V £ U. As \Znear\ approaches infinity, the above construction satisfies the local consistency 
almost surely. 



As we will discuss in Section 4.2 , the size of the support Znear affects the complexity of the iMDP 
algorithm. We note that solving a system of linear equations requires computing and handling a 
matrix of size (d^ + dx + 1) x |Znear| where |^near| is constant. When dx and |.^near| are large, the 
constant factor of the complexity is large. In contrast, computing local Gaussian approximation 
requires only linear | evaluations. Thus, although local Gaussian approximation yields higher time 
complexity, this approximation is more convenient to compute. 



4.1.5 Backward Extension 

Given T > and two states z,z' G S, the procedure ExtendBackwards(z, z', T) returns a triple 
{x,v,t) such that (i) x{t) = f{x{t),u{t))dt and u{t) = v £ U ior all t £ [0,t], (ii) t <T, (iii) 
x{t) G S for all t G [0, r], (iv) x(r) = z, and (v) x(0) is close to z' . If no such trajectory exists, then 
the procedure returns failur^ We can solve for the triple (x, f , r) by sampling several controls v 
and choose the control resulting in a;(0) that is closest to z' . 



4.1.6 Sampling and Discovering Controls 

The procedure ConstructControls(/c, z, y, T) returns a set of k controls in U. We can uniformly 
sample k controls in U. Alternatively, for each state z' G Nearest (z, y, A;), we solve for a control 
V £U such that (i) x{t) = f{x{t),u{t))dt and u{t) = v £ U ioi all t £ [0,T], (ii) x{t) £ S for all 
t £ [0, T], (iii) x(0) = z and x{T) = z'. 



4.2 Algorithm Description 

The iMDP algorithm is given in Algorithm [TJ The algorithm incrementally refines a sequence of 
(finite-state) MDPs Ain = {Sn,U, Pn,Gn, Hn) and the associated holding time function At„ that 
consistently approximates the sytem in Eq. ([l]). In particular, given a state z £ Sn and a holding 
time Atn{z), we can implicitly define the stage cost function Gn{z, v) = Atn{z)g{z, v) for all v £ U 
and terminal cost function Hn{z) = h{z). We also associate with z £ Sn a cost value Jn{z), and 
a control fJ-n{z). We refer to J„ as a cost value function over Sn- In the following discussion, we 
describe how to construct Sn, Pn, Jn, fJ-n over iterations. We note that, in most cases, we only need 
to construct and access P„ on demand. 

In every iteration of the main loop (Lines|4 16 ) , we sample an additional state from the boundary 
of the state space S. We set J„, At„ for those states at Line [5] Subsequently, we also sample a 

■^This procedure is used in the algorithm solely for the purpose of inheriting the "rapid exploration" property of 
the RRT algorithm [TilfTT] . 
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Algorithm 1: iMDP() 


1 {n,So, 


Jo,;Uo,Ato) ^ (1,0,0,0,0); 


2 while 


n < do 


3 


{Sni Jni ^^ni ^tn) ^ ('S'n— 1 ) <^n— 1 j /^n— 1 ; -^^n— 1 ) j 




// 


Add a new state to the boundary 


4 




<r- SampleBoundaryO; 


5 


{Sn, Jn{Zs), ^J-niZs), Atn(Zs)) ^ {Sn U {Zs} , h{Zs) , Uull, 0) ; 




// 


Add a new state to the interior 


6 


Zs 


(r- Sample(); 


7 


^nearest ^ Nearest(2s) Sn, 1); 


8 


if 


(xnew,iinew,T") ^ ExtendBackwards(2;ncarest, ^^s, ?o) then 


9 




Znew ^ ^neui(O), 


10 




cost — T gl^ZYicvj , Uyicw) ~I~ </ri(2!nearcst) , 


11 




('S'n, "^nl-^new) , /^n(2^new) , (^ncw) ) ^ ('S'n U {Zncw}, COSt, U^gui, i 






// Perform L„ > 1 (asynchronous) value iterations 


12 




for i = 1 — 7- L„ do 






// Update Znew and Kn = 0( 5„^) states (O < 6* < 1, if„ < 5„ ) 


13 




■^update ^ Nearest(2;new, 5„\95„, -fC„) U {Znew}; 


14 




for Z £ ^update do 


15 




[_ Update(z,5„, J„,/i„, At„); 


16 


n i 


- n+ 1; 









state from the interior of S (Line [6]) denoted as Zg. We compute the nearest state 2;nearest, which is 
aheady in the current MDP, to the sampled state (Line^. The algorithm computes a trajectory 
that reaches ^nearest starting at some state near z^ (Linejs]) using a control signal iinew(O..T). The 
new trajectory is denoted by Xnew : [0,t] — )• S and the starting state of the trajectory, i.e., Xnew(O), 
is denoted by Znew The new state z^^w is added to the state set, and the cost value Jn{znew), 
control /in (-Znew), and holding time Atn{znew) are initialized at Line 11 



Update of cost value and control 



The algorithm updates the cost values and controls of the finer MDP in Lines T3][T5, We perform 
Ln ^ 1 value iterations in which we update the new state Znew 

and other Kn = e(|5„|^) states 

in the state set where Kn < \Sn\- When all states in the MDP are updated, i.e. Kn + 1 = \Sn\, 
Ln value iterations are implemented in a synchronous manner. Otherwise, Ln value iterations are 
implemented in an asynchronous manner. 



The set of states to be updated is denoted as .^update (Line 13). To update a state z G ^update 
that is not on the boundary, in the call to the procedure Update (Line 15), we solve the following 
Bellman equation]^ 



J„(z) =min{G„(z,t;) + a^*"(^)Ep„ [Jn-i{y)\z,v]}, 



(9) 



Although the argument of Update at Line 15 is J„, we actually process the previous cost values Jn-i due to 
Line|3] We can implement Linejsjby simply sharing memory for (S'n, -/n, ^^n) and (tSn— i, i, /J-n— i? At^— i)- 
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Algorithm 2: Update(z G Sn, Sn, Jn, fJ-n, '^t^) 


1 r 


ComputeHoldingTime(2:, 5'„ ); 


// Sample or discover C„ = 0(log( 5^ )) controls 


2 Un ^ ConstructCoiitrols(C„, z, Sn, t); 


3 for V S Un do 


4 


(Zncar,Pn) ^ ConiputeTranProb(2;, T, S^) ; 


5 


J ^ Tg{z, V) + EyeZnear Pn{y)Jn{y)] 


6 


if J < J„(2:) then 


7 


|_ {Jn{z),lXn{z),/S.tn{z),Kn{z)) ^ {J,V ,T,\Sn\)] 



Algorithm 3: Policy(z G 5, n) 

1 2:ncarcst ^ Nearest(2;, S'n, 1); 

2 return = (/U„(Znearcst), At„(2nearest)) 



and set ^n{z) = v*{z), where v*{z) is the minimizing control of the above optimization problem. 
There are several ways to solve Eq. ^ over the the continuous control space U efficiently. If 
Pn{-\z,v) and g{z,v) are affine functions of v, and U is convex, the above optimization has a 
linear objective function and a convex set of constraints. Such problems are widely studied in 
the literature [25]. More generally, we can uniformly sample the set of controls, called Un, in the 
control space U. Hence, we can evaluate the right hand side (RHS) of Eq. ^ for each f G [/„ to 
find the best v* in Un with the smallest RHS value and thus to update Jn{z) and /in(-z)- When 
lim„_>oo |f^n| = oo, we can solve Eq. ^ arbitrarily well (see Theorem [s]). 

Thus, it is sufficient to construct the set Un with 0(log(| S'nD) controls using the procedure 
ConstructControls as described in Algorithm [2] (Line [2]). The set .^near and the transition 
probability Pn{- \ z,v) constructed consistently over the set Znear are returned from the procedure 
ComputeTranProb for each v ^ Un (Line|4]). Depending on a particular method to build Pn (i.e. 
solving a system of linear equations or evaluating a local Gaussian distribution), the cardinality 
of Znear IS Set to a Constant or increases as ©(logdS'nD). Subsequently, the procedure chooses the 
best control among the constructed controls to update Jn{z) and fin{z) (Line [7]). We note that in 
Algorithm [2| before making improvement for the cost value at z by comparing new controls, we can 
re-evaluate the cost value with the current control finiz) over the holding time At„(z) by adding 
the current control ^n{z) to The reason is that the current control may be still the best control 
compared to other controls in Un- 

Complexity of iMDP 

The time complexity per iteration of the implementation in Algorithms [lj|2] is either Od^nl^ log \Sn{) 
or 0(|S'„|^(log |S'„|)'^). In particular, if the procedure ComputeTranProb solves a set of linear equa- 
tions to construct P„ such that the cardinality of .^near can remain constant, the time complexity 
per iteration is 0(|S„|^ log where log|S'n| accounts for the number of processed controls, and 
\Sn\^ accounts for the number of updated states in one iteration. Otherwise, if the procedure 
ComputeTranProb uses a local Gaussian distribution to construct P„ such that the cardinality of 
Zncar increases as ©(loglSnl), the time complexity per iteration is ©([^^[^(log |S'„|)^) . The pro- 
cessing time from the beginning until the iMDP algorithm stops after n iterations is thus either 
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0(|S'„|-^"'"^ log IS^I) or 0(|S'„|^"'"^(log |5'„|)^) . Since we only need to access locally consistent transi- 
tion probability on demand, the space complexity of the iMDP algorithm is 0(|S'n|). Finally, the 
size of state space 5„ is = Q{n) due to our sampling strategy. 

4.3 Feedback Control 

As we will see in Theorems [7][8j the sequence of cost value functions Jn arbitrarily approximates the 
original optimal cost-to-go J*. Therefore, we can perform a Bellman update based on the approxi- 
mated cost-to-go Jn (using the stochastic continuous-time dynamics) to obtain a policy control for 
any n. However, we will discuss in Theorem [9] that the sequence of /i„ also approximates arbitrarily 
well an optimal control policy. In other words, in the iMDP algorithm, we also incrementally con- 
struct an optimal control policy. In the following paragraph, we present an algorithm that converts 
a policy for a discrete system to a policy for the original continuous problem. 

Given a level of approximation n G N, the control policy fj,n generated by the iMDP algorithm 
is used for controlling the original system described by Eq. ([T]) using the procedure given in Al- 
gorithm [Sj This procedure computes the state in Ai^ that is closest to the current state of the 
original system and applies the control attached to this closest state over the associated holding 
time. 

5 Analysis 

In this section, let {Ain = {Sn,U, Pn,Gn, Hn), Atn, Jn, IJ-n) denote the MDP, holding times, cost 
value function, and policy returned by Algorithm [l] at the end n iterations. The proofs of lemmas 
and theorems in this section can be found in Appendix. 

For large n, states in Sn are sampled uniformly in the state space S pTfj . Moreover, the 
dispersion of 5"^ shrinks with the rate 0((log |S'n|/|5„|)^/'^^) as described in the next lemma. 

Lemma 4 Recall that C,n measures of the dispersion of Sn (Eq. [^. We have the following event 
happens with probability one: 

Qn = 0{{\og\Sn\/\Sn\)^'^^). 

The proof is based on the fact that, if we partition M*^^ into cells of volume O (log(|5'„|)/|S'„|), then, 
almost surely, every cell contains at least an element of 5„, as approaches infinity. The above 
lemma leads to the following results. 

Lemma 5 The MDP sequence {A^njS^o ^'^'^ holding times {Ai„}5^g returned by Algorithm^ are 
locally consistent with the system described by Eq. ([T]) for large n with probability one. 

Theorem [T] and Lemma [5] together imply that the trajectories of the controlled Markov chains 
approximate those of the original stochastic dynamical system in Eq. ([T]) arbitrarily well as n 
approaches to infinity. Moreover, recall that || • is the sup-norm over Sn, the following theorem 
shows that J* converges uniformly, with probability one, to the original optimal value function J* . 

Theorem 6 Given n E N, for all z G Sn, Jni^) denotes the optimal value function evaluated at 
state z for the finite-state MDP Ain returned by Algorithm^ Then, the following event holds with 
probability one: 

lim ||J*- J*||5,^ =0. 

n—¥oo 
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In other words, J* converges to J* uniformly. In particular, 

\\j:-r\\s„ = o{iiog\Sn\/\Sn\r/''^). 

The proof follows immediately from Lemmas [4]j5] and Theorems [2]j3j The theorem suggests that 
we can compute J* for each discrete MDP Ain before sampling more states to construct Ain+i- 
Indeed, in Algorithm [T| when updated states are chosen randomly as subsets of 5„, and is large 
enough, we compute J* using asynchronous value iterations j26l|27]. Subsequent theorems present 
stronger results. 

We will prove the asymptotic optimality of the cost value J„ returned by the iMDP algorithm 
when n approaches infinity without directly approximating J* for each n. We first consider the 
case when we can solve the Bellman update (Eq. |9| exactly and 1 < Ln, Kn = Q){\Sn\^) < \Sn\- 

Theorem 7 For all z G 5"^, Jn{z) is the cost value of the state z computed by Algorithm^ and 
Algorithm^ after n iterations with 1 < Ln, and Kn = Q{\Sn\^) < \Sn\- Let Jn,fi„ be the cost-to-go 
function of the returned policy /i„ on the discrete MDP Mn- If the Bellman update at Eq. is 
solved exactly, then, the following events hold with probability one: 

i. lim„_^oo \\Jn - Jn\\s„ = 0, and lim„^oo 11-4 - J*\\Sn = 0, 

a. lim„^oo \Jn,fM„{z) - J*{z)\ =0, \/z £ Sn- 

Theorem [7] enables an incremental computation of the optimal cost J* without the need to com- 
pute J* exactly before sampling more samples. Moreover, cost-to-go functions Jn,^„ induced by 
approximating policies /i„ also converges pointwise to the optimal cost-to-go J* with probability 
one. 

When we solve the Bellman update at Eq. [9] via sampling, the following result holds. 

Theorem 8 For all z G Sn, Jn{z) is the cost value of the state z computed by Algorithm^ and 
Algorithm^ after n iterations with 1 < Ln, and Kn = Q{\Sn\^) < \Sn\. Let Jn,ii„ be the cost-to-go 
function of the returned policy fin on the discrete MDP Ain- If the Bellman update at Eq. is 
solved via sduiplifig such that lini^^Qo = oo, then 

i. \\Jn — JnWsn converges to in probability. Thus, Jn converges uniformly to J* in probability, 

a. lim„_!.oo \Jn,fj,„iz) — J*{z)\ =0 for all z £ Sn with probability one. 

We emphasize that while the convergence of Jn to J* is weaker than the convergence in Theorem [7| 
the convergence of Jn,/i„ to J* remains intact. Importantly, Theorem [T] and Theorems [TjjS] together 
assert that starting from any initial state, trajectories and control processes provided by the iMDP 
algorithm approximate arbitrarily well optimal trajectories and optimal control processes of the 
original continuous problem. More precisely, with probability one, the induced random probability 
measures of approximating trajectories and approximating control processes converge weakly to 
the probability measures of optimal trajectories and optimal control processes of the continuous 
problem. 

Finally, the next theorem evaluates the quality of any-time control policies returned by Algo- 
rithm [3l 

Theorem 9 Let Jin ■ S ^ U be the interpolated policy on S of fin '■ Sn U as described in 
Algorithm^ 

\/z £ S : Jiniz) = Uniyn) whcrc yn = argmin^,^sj\z' - z\\2. 
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Then there exists an optimal control policy fi* of the original problen^ so that for all z £ S: 

lim ]I^{z) = fi*{z) w.p.l, 

i/ /i* is continuous at z. 

6 Experiments 
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Figure 1: Results of iMDP on a stochastic LQR problem. Figure 1(a) shows the convergence of 
approximated cost-to-go to the optimal analytical cost-to-go over iterations. Anytime solutions are 
compared to the analytical optimal solution after 200 and 600 iterations in Figs. [I(b)p(c)| Mean 
and 1-cj interval of the error ||t/n — '-^^lls'n shown in |l(d)] using 50 trials. The corresponding 



mean and standard deviation of the error || J„ — J*!!^^ are depicted on a log-log plot in Fig. l(e)[ 
In Fig. 1(f) we plot the ratio of \\Jn — J*\\s„ to (log(|S'„|)/|5„|)'^'^ to show the convergence rate of 

l^-^logdS-nl). Ratios 



*. Figure 


1(g) 




are f 



We used a computer with a 2.0- GHz Intel Core 2 Duo T6400 processor and 4 GB of RAM to 
run experiments. In the first experiment, we investigated the convergence of the iMDP algorithm 
on a stochastic LQR problem: inf E[ £ 0.95^{3.5x{tf + 200u{tf}dt + h{x{T))] such that dx = 



'otherwise, an optimal relaxed control policy m* exists [TS|, and jj.^ approximates m* arbitrarily well. 
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(a) Policy after 500 iterations (0.5s). (b) Policy after 1,000 iterations (1.2s). (c) Policy after 2,000 iterations (2.1s). 
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(g) Policy after 4,000 iterations (7.6s). (h) Policy with 10,000 nodes (28s). (i) Policy after 20,000 iterations (80s) 




(j) Contour of J4,( 



(k) Contour of Jio,o 



(1) Contour of J20,o 



Figure 2: A system with stochastic single integrator dynamics in a cluttered environment. With 
appropriate cost structure assigned to the goal and obstacle regions, the system reaches the goal in 
the upper right corner and avoids obstacles. The standard deviation of noise in x and y directions 
is 0.26. The maximum velocity is one. Anytime control policies and corresponding contours of 



approximated cost-to-go as shown in Figs. 2(a) 2(1) indicate that iMDP quickly explores the state 
space and refines control policies over time. 
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(a) Noise-free: 1,000 iterations(1.2s). (b) Stochastic: 300 iterations (0.4s). (c) Stochastic: 1,000 iterations (1.1s). 



Figure 3: Performance against different process noise magnitude. Tlie system starts from (0,-5) to 
reach the goal. In Fig. 3(a), the environment is noise-free. In Figs. 3(b) 3(c)[ standard deviation 
of noise in x and y directions is 0.37. In the latter, the system first discovers an unsafe route that 
is prone to collisions and discovers a safer route after a few seconds. (In Fig. |3(b)[ we temporarily 
let the system continue even after collision to observe the entire trajectory.) 



{3x-\-llu)dt+\/0.2dw on the state space S = [—6, 6] where r is the first hitting time to the boundary 
dS = {—6, 6}, and h{z) = 414.55 for z G dS and otherwise. The optimal cost-to-go from x{0) = z 
is 10.392:^-1-40.51, and the optimal control policy is u{t) = — 0.5714x(i). Since the cost-rate function 
is bounded on S and Holder continuous with exponent 1.0, we use p = 0.5. In addition, we choose 
6 = 0.5, and ? = 0.99 in the procedure ComputeHoldingTime. We used the procedure Update as 
presented in Alogrith m [2] w ith log(n) sampled controls and transition probabilities having constant 
support size. Figures 1(a) 1(c) show the convergence of approximated cost-to-go, anytime controls 
and trajectory to the optimal analytical counterparts over iterations. We observe that in Fig. 
both the mean and variance of cost-to-go error decreases quickly to zero. The log-log plot in 

Jn ' 



1(d) 



Fig. 



1(e) clearly indicates that both mean and standard deviation of the error 



J* 



continue 



to decrease. This observation is consistent with Theorems 7][8, Moreover, Fig. 1(f) shows the ratio 
of II J„ — J*||5„ to (log(|5„|)/|5„|)'^'^ indicating the convergence rate of Jn to J*, which agrees with 
Theorem [sj Finally, Fig. 1(g) plots the ratio of running time per iteration r„ to |S'n|'^'^ logdS^I) 
asserting that the time complexity per iteration is ©(IS'nl"'^ logdS^D) . 

In the second experiment, we controlled a system with stochastic single integrator dynamics 
to a goal region with free ending time in a cluttered environment. The cost objective function is 
discounted with a = 0.95. The system pays zero cost for each action it takes and pays a cost of -1 
when reaching the goal region Xgoai- The maximum velocity of the system is one. The system stops 
when it collides with obstacles. We show how the system reaches the goal in the upper right corner 
and avoids obstacles with different anytime controls. Anytime control policies after up-to 2,000 
iterations in Figs. |2(a)|2(c)[ which were obtained within 2.1 seconds, indicate that iMDP quickly 
explores the state space and refines control policies over time. Corresponding contours of cost value 
functions are shown in Figs. 2(d) 2(f) further illustrate the refinement and convergence of cost value 
functions to the original optimal cost-to-go over time. We observe that the performance is suitable 
for real-time control. Furthermore, anytime control policies and cost value functions after up-to 
20,000 iterations are shown in Figs. 2(g) 2(i) and Figs. 2(j) 2(1) respectively. We note that the 
control policies seem to converge faster than cost value functions over iterations. The phenomenon 
is due to the fact that cost value functions J„ are the estimates of the optimal cost-to-go J*. Thus, 
when Jn{z) — J*{z) is constant for all z G Sn, updated controls after a Bellman update are close 
to their optimal values. Thus, the phenomenon favors the use of the iMDP algorithm in real-time 
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(a) Trajectory snapshots after 3000 iterations (15.8s). 
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Figure 4: Results of a 6D manipulator example. The system is modeled as a single integrator with 
states representing angles between segments and the horizontal line. Control magnitude is bounded 



by 0.3. The standard deviation of noise at each joint is 0.032 rad. In Fig. |4(a) the manipulator 



is controlled to reach a goal with the final upright position. In Fig. |4(b)| the mean and standard 
deviation of the computed cost values for the initial position are plotted using 50 trials. 



applications where only a small number of iterations are executed. 

In the third experiment, we tested the effect of process noise magnitude on the solution trajec- 
tories. In Figs. 3(a) 3(c)[ the system wants to arrive at a goal area either by passing through a 
narrow corridor or detouring around the two blocks. In Fig. 3(a)[ when the dynamics is noise-free 
(by setting a small diffusion matrix), the iMDP algorithm quickly determines to follow a narrow 
corridor. In contrast, when the environment affects the dynamics of the system (Figs. 3(b) 3(c)), 
the iMDP algorithm decides to detour to have a safer route. This experiment demonstrates the 
benefit of iMDP in handling process noise compared to RRT-like algorithms |14| I17j. We empha- 
size that although iMDP spends slightly more time on computation per iteration, iMDP provides 
feedback policies rather than open-loop policies; thus, re-planning is not crucial in iMDP. 

In the forth experiment, we examined the performance of the iMDP algorithm for high dimen- 
sional systems such as a manipulator with six degrees of freedom. The manipulator is modeled as 
a single integrator where states represents angles between segments and the horizontal line. The 
maximum control magnitude for all joints is 0.3. The standard deviation of noise at each joint is 
0.032 rad. The manipulator is controlled to reach a goal with the final upright position in minimum 



time. In Fig. 4(a), we show a resulting trajectory after 3000 iterations computed in 15.8 seconds. 
In addition, we show the mean and standard deviation of the computed cost values for the initial 
position using 50 trials in Fig. |4(b)[ As shown in the plots, the solution converges quickly after 
about 1000 iterations. These results highlight the suitability of the iMDP algorithm to compute 
feedback policies for complex high dimensional systems in stochastic environments. 



7 Conclusions 

We have introduced and analyzed the incremental sampling-based iMDP algorithm for stochastic 
optimal control. The algorithm natively handles continuous time, continuous state space as well 
as continuous control space. The main idea is to consistently approximate underlying continuous 
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problems by discrete structures in an incremental manner. In particular, we incrementally build 
discrete MDPs by sampling and extending states in the state space. The iMDP algorithm refines 
the quality of anytime control policies from discrete MDPs in terms of expected costs over iterations 
and ensures almost sure convergence to an optimal continuous control policy. The iMDP algorithm 
can be implemented such that its time complexity per iteration grows as 0(/c^logA;) withO < < 1 
leading to the total processing time 0(/c^^^ log /c), where k is the number of states in MDPs which 
increases linearly over iterations. Together with linear space complexity, iMDP is a practical 
incremental algorithm. The enabling technical ideas lie in novel methods to compute Bellman 
updates. 

Further extension of the work is broad. In the future, we would like to study the effect of biased- 
sampling techniques on the performance of iMDP. The algorithm is also highly parallelizable, and 
efficient parallel versions of the iMDP algorithm are left for future study. Remarkably, Markov 
chain approximation methods are also tools to handle deterministic control and non-linear filtering 
problems. Thus, applications of the iMDP algorithm can be extended to classical path planning 
with deterministic dynamics. We emphasize that the iMDP algorithm would remove the necessity 
for exact point-to-point steering of RRT-like algorithms in path planning applications. In addition, 
we plan to investigate incremental sampling-based algorithms for online smoothing and estimation 
in the presence of sensor noise. The combination of incremental sampling-based algorithms for 
control and estimation will provide insights into addressing stochastic optimal control problems with 
imperfect state information, known as Partially Observable Markov Decision Processes (POMDPs). 
Although POMDPs are fundamentally more challenging than the problem that is studied in this 
paper, our approach differentiates itself from existing sampling-based POMDP solvers (see, e.g., |28| 
129] ) with its incremental nature and computationally-efficient search. Hence, the research presented 
in this paper opens a new alley to handle POMDPs in our future work. 
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Appendix 

A Notations and Preliminaries 

We denote N as a set of natural numbers and M as a set of real numbers. A sequence on a set X 

is a mapping from N to X, denoted as {x„}^g, where ,x„ G X for each n G N. Given a metric 
space X endowed with a metric d, a sequence {x„}^q C X is said to converge if there is a point 
X G X, denoted as lim^^oo^Jn; with the following property: For every e > 0, there is an integer 
N such that n > N implies that d{xn, x) < e. On the one hand, a sequence of functions {/n}^i 
in which each function is a mapping from X to M converges pointwise to a function / on X if 
for every x £ X, the sequence of numbers {/n(x)}^Q converges to f{x). On the other hand, a 
sequence of functions {fn}'^=i converges uniformly to a function / on X if the following sequence 
{Mn I Mn = sup^gx \fnix) - /(x)|}^=o Converges to 0. 

Let us consider a probability space (^2, P) where J7 is a sample space, J-" is a a-algebra, and 
P is a probability measure. A subset ^4 of J-" is called an event. The complement of an event A is 
denoted as A"^. Given a sequence of events {An}^^Q, we define limsup^^o^ An as Hj^g ^k^n 
i.e. the event that An occurs infinitely often. In addition, the event lim inf„^oo A^ is defined as 
U^o l^fc^n ^k- A random variable is a measurable function mapping from J7 to M. The expected 
value of a random variable Y is defined as E,[Y] = YdF. A sequence of random variables {l^j^g 
converges surely to a random variable Y if lim„^oo^('^) = Y{u}) for all uj £ Q. A sequence of 
random variables {i^jj^g converges almost surely or with probability one (w.p.l) to a random 
variable Y if P(a; G | lim„_^oo^(w) = = 1- Almost sure convergence of {l^}^g to 

Y is denoted as Yn Y. We say that a sequence of random variables {Yn}'^^Q converges in 
distribution to a random variable Y if lim„^oo -^n(3;) = F(x) for every x G M at which F is 
continuous where {Fn}^^Q and F are the associated CDFs of {i^}^o ^^'^ ^ srespectively. We 
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denote this convergence as Yn — )■ y. Convergence in distribution is also called weak convergence. If 
Yn -^Y, then lim„_>>oo = E[/(y)] for all bounded continuous functions /. As a corollary, 

when {Yn}^^Q converges in distribution to 0, and y„ is bounded for all n, we have lim^^oo IE[y„] = 
and lim„_^oo IE[y^] = 0, which together imply lim„_j. oo ^or(y„) = 0. We say that a sequence of 
random variables {y^lj^o converges in probability to a random variable Y, denoted as Yn A y, 
if for every e > 0, we have limn^oo^i\Xn — X\ > e) = 0. For every continuous function /(•), if 
y„ 4 y, then we also have /(y„) A f{Y). If y„ A y and Z„ 4 Z, then (y„, Z„) 4 {Y, Z) . If 

\Zn — Yn\ 4 and Yn Y, we have Zn —> Y. Finally, we say that a sequence of random variables 
{y„}^o converges in r*'^ mean to a random variable Y, denoted as y„ — )• Y, if E[|Xn|''] < oo for all 
n, and lim„_!.oo Xl*"] = 0. We have the following implications: (i) almost sure convergence or 

^th j^gg^jj convergence (r > 1) implies convergence in probability, and (ii) convergence in probability 
implies convergence in distribution. The above results still hold for random vectors in higher 
dimensional spaces. 

Let f{n) and g{n) be two functions with domain and range N or M. The function f{n) is called 
0{g{n)) if there exists two constants M and no such that f{n) < Mg{n) for all n > uq. The 
function f{n) is called Q{g{n)) if g{n) is 0{f{n)). Finally, the function f{n) is called Q{g{n)) if 
/(n) is both 0{g{n)) and n(g{n)). 



B Proof of Lemma [4] 

For each n G N, divide the state space S into grid cells with side length 1/27.^ (log |5„|/|S'„|)^/'^"' as 
follows. Let Z denote the set of integers. Define the grid cell i £ Z*^^ as 



1 /log|5„|V/'^- 1 ^log|S„|^^/''- 



where [— a, a]'^^ denotes the d^-dimensional cube with side length 2 a centered at the origin. Hence, 
the expression above translates the d^-dimensional cube with side length (1/2) 7r(log |5„|/|S'„|)-'^/'^"' 
to the point with coordinates i ^{logn/n)^^'^'' . 

Let Qn denote the indices of set of all cells that lie completely inside the state space 5, i.e., 
Qn = G Z'^ : Wn{i) ^ S}. Clearly, Qn is finite since S is bounded. Let dQn denote the set of all 
grid cells that intersect the boundary of S, i.e., dQn = {i E Z"^ : Wn{i) H dS 7^ 0}. We claim for all 
large n, all grid cells in Qn contain one vertex of Sn, and all grid cells in dQn contain one vertex 
from dSn.. First, let us show that each cell in Qn contains at least one vertex. Given an event A, let 
A'^ denote its complement. Let An^k denote the event that the cell Wn{k), where k G Qn contains a 
vertex from Sn, and let An denote the event that all grid cells in Qn contain a vertex in Sn- Then, 
for all k G Qn, 

F K.) = (1 - ^)""' ^ (-((7./2)^V-(5)) logl^nl) = |5.r(>/^)^^/-(^), 

where m{S) denotes Lebesgue measure assigned to S. Then, 



^ ______ -(7.72)'^- /m(S) 

where the first inequality follows from the union bound and | Q„ | denotes the cardinality of the set 
Qn- By calculating the maximum number of cubes that can fit into S, we can bound \Qn\- 

m{S) m{S) \Sn\ 



\Qn\ < 



(^^/2)ci. (7./2)'^Mog|5n| 
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Note that by construction, we have jS'nl = G(n). Thus, 

^^^"^ - (7./2)^^ log|S„| l^'-l - (7./2)^^ log|5„| l^'-l 

which is summable for ah 7^ > 2 (2 m(S'))^/'^^ . Hence, by the Borel-Cantelh lemma, the probability 
that occurs infinitely often is zero, which implies that the probability that An occurs for all 
large n is one, i.e., P(lim inf „_>oo ^n) = 1- 

Similarly, each grid cell in dQn can be shown to contain at least one vertex from dSn for all 
large n, with probability one. This implies each grid cell in both sets Qn and dQn contain one 
vertex of Sn and dSn, respectively, for all large n, with probability one. Hence the following event 
happens with probability one: 

Cn = max min \\z' - z\\2 = 0{{\og\Sn\/\Sn\f'''''). 

z&S„ z'&S 

□ 



C Proof of Lemma [5] 

We show that each state that is added to the approximating MDPs is updated infinitely often. 
That is, for any z G Sn, the set of all iterations in which the procedure Update is applied on z is 
unbounded. Indeed, let us denote Cn{z) = min^/g^^ \ \z' — z\\2. From Lemma [4| lim„_^oo Cn{z) = 
happens almost surely. Therefore, with probability one, there are infinitely many n such that 



Cn{z) < Cn-i(-z) • In other words, with probability one, we can find infinitely many Znew at Line 13 



of Algorithm [T] such that z is updated. For those n, the holding time at z is recomputed as 

/log 1 5 ix^'jp/i^a; n 

Atn{z) = 7t ( |g I ) at Line 1 of Algorithm 2 Thus, the following event happens with 

probability one: 

lim Atn{z) = 0, 

n— >oo 

which satisfies the first condition of local consistency in Eq. [3j 

The other conditions of local consistency in Eqs. 4]|6 are satisfied immediately by the way that 



the transition probabilities are computed (see the description of the ComputeTranProb procedure 



given in Section 4.1). Hence, the MDP sequence {Mn}^=o holding times {At„}^Q are locally 



consistent for large n with probability one. □ 



D Proof of Theorem [T] 

To highlight the idea of the entire proof, we first prove the convergence under synchronous value 
iterations before presenting the convergence under asynchronous value iterations. As we will see, 
the shrinking rate of holding times plays a crucial role in the convergence proof. The outline of the 
proof is as follows. 

SI: Convergence under synchronous value iterations: In Algorithm [T| we take -Ln > 1 and Kn = 
ISnl — 1. In other words, in each iteration, we perform synchronous value iterations. Moreover, 
we assume that we are able to solve the Bellman equation (Eq. [9| exactly. We show that J„ 
converges uniformly to J* almost surely in this setting. 
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S2: Convergence under asynchronous value iterations: When Kn = 0(|5„|^) < we only 

update a subset of Sn in each of Ln passes. We show that J„ still converges uniformly to J* 
almost surely in this new setting. 

In the following discussion and next sections, we need to compare functions on different do- 
mains Sn- To ease the discussion and simplify the notation, we adopt the following interpolation 
convention. Given X G Y and J : X — )■ M, we interpolate J to J on the entire domain Y via 
nearest neighbor value: 

\/y £ Y : J{y) = J(z) where z = argminygj^||z' — y\\- 

To compare J : X — )• M and J' : y — )• M where X,Y C S, we define the sup-norm: 

II 7" _ j'w — 117 — T'll 

where J and J' are interpolations of J and J' from the domains X and Y to the entire domain S 
respectively. In particular, given Jn : Sn ^ K, and J : — )• M, then ||Jn — -^llsn < ||>^n — -^Hoo- 
Thus, if II Jn — J||oo approaches when n approaches oo, so does || Jn — JH^^. Hence, we will work 
with the (new) sup- norm 1 1 • 1 1 qq instead of 1 1 * 1 1 in the proofs of Theorems |7||8| The triangle 
inequality also holds for any functions J, J', J" defined on subsets of S with respect to the above 
sup-norm: 

ll'-^ ^ l|oo — W'^ J l|oo ~t~ W'J J ||oo* 

Let B{X) denote a set of all real- valued bounded functions over a domain X. For 5„, C Sn' 
when n < n' , a function J in B{Sn) also belongs to B(Sn'), meaning that we can interpolate J on 
Sn to a function J' on 5„/. In particular, we say that J in B{Sn) also belongs to B{S). 

Lastly, due to random sampling, Sn is a random set, and therefore functions J„ and J* defined 
on Sn are random variables. In the following discussion, inequalities hold surely without further 
explanation when it is clear from the context, and inequalities hold almost surely if they are followed 
by "w.p.l". 

SI: Convergence under synchronous value iterations 

In this step, we first set L„ > 1 and Kn = IS'nl — 1 in Algorithm [T] Thus, for all z £ Sn, 

/log 15 \\^^P/'^^ 

the holding time At„(z) equals jt ( | g | ) is denoted as At„. We consider the MDP 

^An = {Sn,U, Pn,Gn, Hn) at n^^ iteration and define the following operator Tn : B{Sn) — )• B{Sn) 
that transforms every J G B(Sn) after a Bellman update as: 

TnJ{z) = mm{Gn{z,v)+a'^'-EpJJ{y)\z,v]}, ^z e Sn, (10) 

assuming that we can solve the minimization on the RHS of Eq. [TO] exactly. For each k > 2, 
operators are defined recursively as = TnT!^~^ and = Tn- When we apply r„ on 
J G B{Sk) where k < n, J is interpolated to Sn before applying T„. Thus, in Algorithms [l]j2| we 
implement the next update 

7 _ rpLn J 

'Jn — ^n '^n—l- 



From [26], we have the following results: J* = TnJn, and T„ is a contraction mapping. For any J 
and J' in B{Sn), the following inequality happens surely: 



|r„j-r„j'||oo <a'^*1|J-/ 
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Combining the above results: 



17* Til II T^in 7* T^in 7 I | ^ ru^n Atn | | 7* 7 J 

\'^n "^nlloo — IKn n n "^Ji— l||oo ^ " ll"^n «^n— 1| 



< a^*"(||J* <^ri-ll|oo + ||<^n-l '^n- 



l\\oo) 



where the second inequality follows from the triangle inequality, and L„ > 1, a G (0, 1) 
Thus, by iterating over n, for any > 1 and n > N, we have: 

II J: - Jnlloo < An + aAt„+At„_i...+Ai^.+i|| _ 

where An are defined recursively: 

An = a^'HWJ: - J:_i||oo +^n-l), \/n>N + l, 

= 0^*^^+111 jiV+i-JiViioo. 

Note that for any > 1: 



lim Atn + Atn-l--- + AtN+l = oo, 
n— >-oo 



due to the choice of holding times At„ in the procedure ComputeHoldingTime. Therefore, 

lim a^*"+-+^*^+i||JiV - Jnlloo = 0. 

n—^co 

By Theorem rol the following event happens with probability 1 (w.p.l): 



lim II J* - J*||oo =0, 

n— >oo 

hence, 

lim 1 1 J* - J*_i||oo = w.p.l. 

n— >oo 

Thus, for any fixed e > 0, we can choose N large enough such that: 

ll'^n ~ '^n-illocT'" < f w.p.l for all n > N, and 
^At„+...+At^+i|| - JtvIIoo < e surely, 

where ? G (0, 1) is the constant defined in the procedure ComputeHoldingTime. 



Now, for all n > N, we rearrange Eqs 12 13 to have 

An < eB n W.p.l, 

where 

Bn = a^'-m: - J:_iI|^ +i?„-i), Vn > iV + l, 

BN+, = a^'^+^\\r^^,-r^\\l,. 

We can see that for n > N + 1: 

Bn = a^*"(P: - J:_i||^ + < e^/(i-^) + w.p.l, 

Bm+1 = a^*^+i|| JiV+i - JUL < e'/^'"'^ w.p.l. 
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We now prove that almost surely, Bn is bounded for all n > N. Indeed, we derive the conditions 
so that Bn-i < Bn or Bn-i > Bn as follows: 

Bn-l < Bn 

^ i?„_i<a^*"(iij:-j:_iiiL+i?n-i) 

a' 



Bn-l < 



^Atn I I 7* _ 7* IK 

\'^n "^n— llloo 



1 — a^*" 



Bn-l < fC W.p.l. 



The last inequality is due to Theorem |6] and jS^I = @{n), |5'„_i| = Q{n — 1): 



I 7* 7* 1 1 
I'^n ~ '^n-ll loo 



0((log|5„_i|/|S„_i|)^/'^-) </C 



log |'S'„ 

Sn 



p/dx 



W.p.l, 



for large n where tC is some finite constant. Let /3 = a'^' G (0, 1). For large n, ^"^'^"^ are in (0, 1) 
and 6 G (0, 1]. Let us define 



I , and Vn 



Sn 



Then, x„ > y„ > 0. The above condition is simplified to 



log \Sn 
Sn 



Consider the function r : [0, oo) — )• M such that r{x) = i_px , we can verify that r(x) is non- 
increasing and is bounded by r(0) = — l/log(/3). Therefore: 



Bn-l < Bn =^ Bn-l < 



JC 



JC 



log(/3) 7t log(a) 



w.p.l. 



Or conversely. 



Bn-l > 



/C 



It log(a) 



w.p.l 



Bn-l > Bn W.p.l. 



(18) 



(19) 



The above discussion characterizes the random sequence Bn- In particular, Fig.[5]shows a possible 
realization of the random sequence Bn for n > N. As shown visually in this plot, -Bat+i is less 



than e''/^^ w.p.l and thus is less than e"^^^ 

we have already shown that Bn-i is bounded from above by e^/^^^*"-* 



It log(«) 



^ w.p.l. For n > + 1, assume that 

w.p.l. When 



Bn-l > i-^(a) w.p.l, the sequence is non-increasing w.p.l. Conversely, when the sequence 

and the 



It log(a) 



It log{a) 

is increasing, i.e. Bn-i < Bn, we assert that Bn-i < 
increment is less than e'^/^^"'^-' due to Eq. 16 



w.p.l due to Eq. 



?) , w.p.l. Hence, from Eqs. 

7t log(o) ^ ' ^ 



18 



16 



Jt log(a) 

n both cases, we conclude that Bn is also bounded by 
19l we infer that Bn is bounded w.p.l for all n > N: 



Bn < e^/(^-^) 



JC 

It log(a) 



w.p.l. 
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Figure 5: A realization of the random sequence B^- We have B^+i less than e'^/*^^"'^-' w.p.l. 
For n larger than + 1, when Bn-i > — \og(^a) '^•P-l; the sequence is non-increasing w.p.l, 
i.e. Bn~i > Bn w.p.l. Conversely, when the sequence is increasing, i.e. i?n-i < Bn, we have 
Bn-i < — iog{a) the increment is less than e''^^^~''\ Hence, the random sequence Bn is 

bounded by e''/'^-'^"'') r^V-r w.p.l. 



Thus, for all n > A: 



An<eBn<e[ e'/^^-'^ - ^ ) w.p.l. (20) 

7tlog(a) 



Combining Eqs. |llp5 , and 20, we conclude that for any e > 0, there exists N >1 such that for 
all n > A, we have 

It log (a) 



Therefore, 



Combining with 



we obtain 



lim \\J* - JnWoo = w.p.l. 



lim \\J* — J*\\oo = w.p.l, 
n— >oo 



lim \\Jn — J*\\oo = W.p.l. 
n— >oo 



/log 15 \\^'^p/<^x 

In the above analysis, the shrinking rate ( |^ | j of holding times plays an important role 
to construct an upper bound of the sequence Bn- This rate must be slower than the convergence 

/log IS \\P/'^=^ 

rate ( |g | j of J* to J* so that the function r(x) is bounded, enabling the convergence of 
cost value functions Jn to the optimal cost-to-go J* . Remarkably, we have accomplished this 
convergence by carefully selecting the range (0, 1) of the parameter ^. The role of the parameter 9 
in this convergence will be clear in Step S2. Lastly, we note that if we are able to obtain a faster 
convergence rate of J* to J* , we can have faster shrinking rate for holding times. 
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S2: Convergence under asynchronous value iterations 

When 1 < Ln and = e(|5„0 < | Sn\, we first claim the following result: 

Lemma 10 Consider any increasing sequence {n^j^Q as a subset of N such that no = and 
k<\Sn,\< k^'^. For J e B{S), we define: 



The following event happens with probability one: 



ni I loo 



+ ... + a'^'-'-WJL - .iu 



Jta,^({„,},t.)=0. 
Proof We rewrite A(^{nj}j^Q) = where An^, are defined recursively: 

An, =a^*"KII^4-^n,_J|oo+A„,_J, yk>K, (21) 

Ar,,=A{{nj}f^,), Vi^>l. (22) 

We note that 

Atn„ + Atn,_, + ... + Atn^ 

f loglSnj y/"^ flogjS^y/'^ ^log|S„,|^^^^/'^^ 



Cl / '/'•I |o I / i*'*i/t'i 1^ 



^ 7« TTT-r + 7t 77^ r + ... + 7t 



Q \ i \IC 1/ \IC I 

'-'"fcl/ VPrifc-il/ VPnifl 

1 1 111 1 

-^*^;^ + ^*(fc-i)?P/d. +- + ^*(x)?P/<icc - ^ ^^ri + - + 

where the second inequality uses the given fact that \Snf^\ < k^^^. Therefore, for any K > 1: 

lim a^*"*^^*"^-!-^^*"-^ =0. 

fc— >CXD 

We choose a constant g > 1 such that < 1. For any fixed e > 0, we can choose K large enough 
such that: 

\\Jn, - Jn,J\l^'' < e w.p.l for all k>K. (23) 

For all A; > we can write 

where 

Bn, = a^'-^ (II - J*^J\^^ + Bn,_,), Vfe > K, 

Furthermore, we can choose K' sufficiently large such that K' > K and for all k > K': 

a'^*"fc+-+^*™yl({nj}jLo) < e- 
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We obtain: 

An^ < eBn^ + e, > K' > K > 1. 
We can also see that for k > K: 

Br,, = - Jn,J\^ + 5n._J < e^/i'-^^ + w.p.l. (24) 

Similar to Step SI, we characterize the random sequence B^, as follows: 

a fc||J„^ -'rifc-illoo 

-Dnfc-i < ; AT 

1 - a^*"fc 



-Dni. 1 < A. 5 — t:; w.p.l. 



1 ^* 

1 — a 



Let /3 = a^t G (0, 1). We define: 



^fc = 1 \Q I , and yfc = — — 



We note that is a decreasing function for positive x. Since |'S'„j._J > — 1 and l-S^^,! < fc^/^, 
we have the following inequalities: 

(C^fY"' ^({\og{k-i)Y^^^/''^ 
xk > — -, — , yk< 



k J ' """-y {k-i)s 

Since 9 G (0, 1] and g > I, we can find a finite constant /Ci such that < Kix^ for large k. Thus, 
the above condition leads to 



Therefore: 



r> r> n /C/Cl /C/Cl 

-Dnfe.i < -Dnfc ^ -Dnfe_i < 7^ = ] W.p.l. 



Or conversely, 

/C/Ci 
7t log (a 

Arguing similarly to Step SI, we infer that for all /c > > K > 1: 

< e^/('-^^ - w.p.l. 
It log (a) 

Thus, for any e > 0, we can find K' > 1 such that for all k > K': 



Bnk-i > rTTTTT ^.p.l ^ > 5„ W.p.l. 



An, < eSn, + e < 6(6^^/(1-^^) - + 1) w.p.l. 

V 7*log(a) / 



It log(a) 
We conclude that 

lim A({no}^=o) = 0- w.p.l. 



□ 
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Returning to the main proof, we use the tilde notation to indicate asynchronous operations to 
differentiate with our synchronous operations in Step SI. We will also assume that L„ = 1 for all 
n to simplify the following notations. The proof for general L„ > 1 is exactly the same. We define 
the following (asynchronous) mappings r„ : B{Sn) — )• B{Sn) as the restricted mappings of Tn on 
Dn, a non-empty random subset of 5„, such that for all J E B{Sn)- 

fnJiz)=mm\Gniz,v) + a^'-Ep„[j{y)\z,v]}, Vz G L»„ C 5„, (25) 

fnJ{z) = J{z), yz£Sn\Dn. (26) 

We require that 

n^^,ur=nDk = s. (27) 

In other words, every state in S are sampled infinitely often. We can see that in Algorithm [T| if 
the set .^update is assigned to Dn in every iteration (Line 13), the sequence {Dn}^=i has the above 
property, and \Dn\ = Q{\Snf) < \Sn\- 

Starting from any Jq G B{So), we perform the following asynchronous iteration 

Jn+l=fn+lJn, Vn > 0. (28) 

Consider the following sequence {w-^j^Q such that mo = and for all A; > 0, from to 
mfc+i — 1, all states in 5^^+1-1 are chosen to be updated at least once, and a subset of states 
in Smk+-i-i is chosen to be updated exactly once. We observe that as the size of S'„ increases 
linearly with n, if we schedule states in Dn C Sn to be updated in a round-robin manner, we 
have k < Sm^. < k^^^ . When Dn is chosen as shown in Algorithm 111 with high probability, 
k < Sm^. < k^^^ . However, we will assume that the event k < Sm^ ^ /c"'^'^ happens surely because 
we can always schedule a fraction of D„ to be updated in a round-robin manner. 

We define Wn as the set of increasing sub-sequences of the sequence {0, 1, n} such that each 
sub-sequence contains {mj}j^Q where nik < n < m^+i: 

Wn = [{ij}J=o I {mj}'^=o C {ij}J=o ^ {0, 1, n} AT>2Amk<n< mfc+i}. 
Clearly, if {ij}J^Q € Wn, we have io = 0. For each {ij}J^Q G Wn, we define 



oo 



-I- -I- (tA**t II/*—/* II 
-|- ... -|- u; ^ W'Jij, "^iT-illoo- 



We will prove by induction that 



yzeDn^\Jn{z)-J:{z)\< max A {{ij }J^q) . (29) 

When n = 1, the only sub-sequence is {ij}J^Q = {0, 1} G Wi. It is clear that for z G Di, due to 
the contraction property of Ti: 



\JUz) - Ji{z)\ < ^ max^^^ A{{ij}J^o) = J* - JqHoo- 



Assuming that Eq. 29 holds upto n = ruk, we need to prove that the equation also holds for those 
nG (mfc, mfc-|_i) and n = mfc+i- Indeed, let us assume that Eq. 29 holds for some n G [mfc,mfc_(.i — 1). 
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Denote < n as the index of the most recent update of z. For z G we compute new values 
for z in Jn+i, and by the contraction property of T'ra+i, it follows that 



|j„+i(z)-j:+i(z)i<a^'"+^iij:+i-j, 



^Ai„+i I J*^^(z) - J„(z) 



= a^*-+i max | J:+i(z) - J„,(z)| 

<a^*"+i max (| J*^(z) - J„^(z)| + || J*+i - J^J^) 



< max a^*"+^ max ^({^,}J=o) + " 



I T 

max " ' ' " ' 



The last equality is due to n + 1 < m^+i — 1, and {mjj^^Q C {{zj}J^g, n + 1} C {0, 1, n + 1} for 



any {ijjj^g G Wn^. Therefore, Eq. 29 holds for all n G {mk,mk+i — 1]. When n = m^+i — 1, we 



also have the above relation for all z G Dn+i: 



I J„+i(z) - J:+i(z)| < max a^*"+i max ^({i,}J=o) + " 



max 



^({b}J=o)- 

1 



The last equality is due to n + 1 = m/j+i and thus {jn-jlj^o {{*j}j=0' ''^ + 1} {0, 1, n + 1} for 

also holds for n = mk+i and this completes the induction. 



29 
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any G Wn^- Therefore, Eq. 

We see that all {ij}J^Q G Wn, we have j < ij < mj, and thus j < Si^ < j^l^ ■ By Lemma 

lim ^({L}f^o £ ^n) = w.p.l. 

Therefore, 

lim sup I — J*(2;)| = w.p.l. 

Since all states are updated infinitely often, and J* converges uniformly to J* with probability one, 
we conlude that: limn_>.oo \\Jn — •^nlloo = w.p.l. and lim^^oo \\Jn — </*||oo = w.p.l. 

In both Steps SI and S2, we have lim„_!.oo I |<^n — <^nl loo = w.p.lj^ therefore //„ converges to /x* 
pointwise w.p.l as /i„ and /i* are induced from Bellman updates based on J„ and J* respectively. 
Hence, the sequence of policies {/i„}^g has each policy 

\in S-S an 6j2"Optimal policy for the MDP 
M-n such that lim„_5.oo fin = 0. By Theorem [2j we conclude that 

lim \ Jn,ii.SA - >^*(^)l =0) £ Sn w.p.l. 

n— >oo 

□ 

E Proof of Theorem [8] 

We fix an initial starting state x(0) = z. In Theorem [7j starting from an initial state x(0) = z, 
we construct a sequence of Markov chains G N}^]^ under minimizing control sequences 



^The tilde notion is dropped at this point. 
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{m";z G N}J^]^. By convention, we denote the associated interpolated continuous time trajectories 
and control processes as {C{t);t G R}n=i and G ^}n=i repsectively. By Theorem [l| 

{C"'{t);t G converges in distribution to an optimal trajectory {x*(t);t G M} under an optimal 

control process {u*{t);t G M} with probability one. In other words, {(.^ {■) , {•)) — >■ {x* {■) , u* {■)) 
w.p.l. We will show that this result can hold even when the Bellman equation is not solved exactly 
at each iteration. 

In this theorem, we solve the Bellman equation (Eq. [9| by sampling uniformly in U to form a 
control set Un such that lim„_!.oo \ Un\ =00. Let us denote the resulting Markov chains and control 
sequences due to this modification as {^j ; i G N}^^ and {uf; i G N},^]^ with associated continuous 
time interpolations {t);t G and {n"(t);t G M}^=i. In this case, randomness is due to 

both state and control sampling. We will prove that there exists minimizing control sequences 
{u"; i G N},^]^ and the induced sequence of Markov chains {^f; i G N}^;^ in Theorem [7] such that 

(f (.)_^n(.),^n(.)_^"(.))4(0,0), (30) 



where (0, 0) denotes a pair of zero processes. To prove Eq. 30, we first prove the following lemmas. 
In the following analysis, we assume that the Bellman update (Eq.joj) has minima in a neighborhood 
of positive Lebesgue measure. We also assume additional continuity of cost functions for discrete 
MDPs. 

Lemma 11 Let us consider the sequence of approximating MDPs {Ain}'^=o- For each n and a 
state z G Sn, let t>* he an optimal control minimizing the Bellman update, which is refered to as an 
optimal control from z: 

vl G V* = ar(7mm^gf;{G„(z,i;) +a^*"(^)Ep„ [J„_i(y)|z, i;]}, 
J„(z,<) = Jl{z) = Gn{z,vl) + a'^*"(^)Ep„ [J„_i(y)|z,<] , V< G K*- 

Let Vn he the best control in a sampled control set Un from z: 

Vn = argmin^^u^{Gn{z,v) + a'^'^^'^^Ep,^ [J„_i(y)|z, w]}, 

Jn{z,Vn) = G„(z,Vn) +a'^*"^^^Ep„ [Jn-l{y)\z ,Vn] ■ 

Then, when lim„_s.oo \Un\ = 00, we have | Jn(-z, — J*(z)| A as n approaches 00, and there exists 
a sequence {w* | f* G V*}^^q such that \ \vn — 'Wnlb ^ 0. 

Proof We assume that for any e > 0, the set A" = {v £ U\ \ Jn{z,v) — Jn{z)\ < e} has positive 
Lebesgue measure. That is, m{A'^) > for all e > where m is Lebesgue measure assigned to U. 
For any e > 0, we have: 

mJniz,Vn) - rniz)\ > e}) = (l - m{A^) / m{U)f"^ . 
Since 1 — m{Al^)/m(U) G [0, 1) and lim„_!.oo \Un\ = 00, we infer that: 

lim F{{\Jn{z,Vn)-rn{z)\>e})=0. 

Hence, we conclude that \ Jn{z,Vn) — Jn{z)\ A as n — 00. Under the mild assumption that 
Jniz, v) is continuous on U for all z G 5„, thus there exists a sequence {v* | G V*}^^^ such that 
\\vn — "t^ralb A as n approaches 00. □ 
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Figure 6: An illustration for Lemma 12 We have converges in probability to . From ^g , the 
optimal control is f* that results in the next random state ^f. From ^g, the optimal control and 
the best sampled control are Vn and Vn respectively. The next random state from ^g due to the 
control is . 



By Lemma 11 we conclude that || Jn — J^||oo converges to in probability. Thus, J™ returned from 
the iMDP algorithm when the Bellman update is solved via sampling converges uniformly to J* in 
probability. We, however, claim that Jn,ii„ still converges pointwise to J* almost surely in the next 
discussion. 



Lemma 12 With the notations in Lemma 
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consider two states and^ such that ||Co~Coll2 
as n approaches oo. Let ^'l he the next random state of under the best sampled control Vn from 
^g. Then, there exists a sequence of optimal controls from ^g such that \\vn — v^\\2 — >■ and 
lICi ~ crib A as n approaches oo, where is the next random state of under the optimal 
control from ^g . 



Proof We have Vn as the best sampled control from ^g. By Lemma 11, there exists a sequence of 



'0- 



optimal controls u„ from such that ||u„ — f^lb — ^ 0. We assume that the mapping from state 
space Sn, which is endowed with the usual Euclidean metric, to optimal controls in U is continuous. 
lICo "Colb 0; there exists a sequence of optimal controls from Q such that ||t'n — ^^nlb 0. 

\ \ P \ \ P \ \ P I I 

Now, \\vn — Vn\\2 " ^ and \\vn — t'nib lead to \ \vn — "yj^lb — ^ as n — )• oo. Figure [6| illustrates 
how Vn,Vn, and f* relate and 

Using the probability transition P„ of the MDP that is locally consistent with the original 
continuous system, we have: 

mi I eo,«o = <] = Co + /(Co",<)At„(Co") + o(At„(Cg")), 

\C^,U^ = Vn] = Co + f(^,Vn)Atn(i^) + o{Atn(C^)), 

covic"^ I ^^,u^ = <] = F(eg^<)F(eg^<)^A^„(eo") + o(At„(Co«)), 

Cov[r, I eo,n^ = Vn] = F(?o), t;„)F(eo),^J„)^At„(eo)) + o(At„(?o))), 

where /(•,•) is the nominal dynamics, and F{-, ■)F{-, ■)'^ is the diffusion of the original system 
that are assumed to be continuous almost everywhere. We note that At„(^g) = At„(^Q) = 

7t(log(|5„,|)/|5',i|)^''^^'^'' as ^g and ^g are updated at the n*'* iteration in this context, and the hold- 
ing times converge to as n approaches infinity. Therefore, when | |Co ~ Co I b ^ 0) I l^n — ^^nl b 0, 
we have: 

nTi- er \^oTo,u^ = v:,u^ = 4 o, (31) 

CoviTi - Cr I Co", Co , = = Vn) 4 0. (32) 
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Since ^ and are bounded, the random vector — | Co'^O'^o 



■n 



Vn] and random 



matrix Cov{(,-^ - | Co.^o>^o 

d 



Vn) are bounded. We recall that if Yn — )• 0, and hence 



Yn ^ 0, when Yn is bounded for all n, lim„_i.oo IE[^n] = and lim„_5.oo C'oT;(y„) = 0. Therefore, 
Eqs. [3TH32] imply: 



hm E 

n— >oo 



hm cov(E[e"-er i cs^ilu^ 



hm E 

n— >oo 



= 0, (33) 

V*^,U^ = Vn])=0, (34) 

'coviTi - er I Co", To, u^ = v:,ui^ = vn)] = o. (35) 

The above outer expectations and covariance are with resepect to the randomness of states , 
and sampled controls C/„. Using the iterated expectation law for Eq. 33, we obtain: 

hm E[e?-er] = o. 

n— ^oo 

Using the law of total covariance for Eqs. [34||35[ we have: 

lim Coi;[Ci -Cr] = 0- 

Since 

mTi - ml] = mTi - c^ fcCi - m = wnTi - cmii + tricov[r^ - m, 



the above limits together imply: 



lim E[|iei -erii2] = o. 



In other words, converges in 2 -mean to i^", which leads to ||^;^ — ^flb 
□ 



as n approaches oo. 



Returning to the proof of Eq. 30 we know that Co ~ ^0 = z as the starting state. From any y ^ Sn, 
an optimal control from y is denoted as v*{y), and the best sampled control from the same state y 
is denoted as v{y). 



By Lemma 12, as Uq = v{^q), there exists Uq = v*{(,q) such that \\uq 



u^\\2 4 and lie" 



1 1 

?i II2 



0. Let us assume that 



-ilbillCfc ~ Cfclb) converges in probability to (0,0) 



upto index k. We have = i'(Cfc)- Using Lemma 12, there exists = v*{(,J^) such that 



""fcl bi I iCfc+i ~ Cfc+ilb) 4 (0,0). Thus, for any i > 1, we can construct a minimizing control n" 



in Theorem [t] such that ( 1 1 C" 
immediately: 



Cll2,ll< 



<||2) A (0,0) as n 



00. Hence, Eq. 
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follows 



(r(-)-r(-),^"(-)-^"(-))4(o,o). 



We have (C^ {■) , u"' {■)) {^*{')} ^*(")) w.p.l. Thus, by hierarchical convergence of random variables 
[50] , we achieve 

(f(.),^n(.))4(x*(-),^*(-)) W.p.L 

Therefore, for all z £ Sn- 

lim \ Jn,ti„iz) - J*iz)\ = w.p.l. 



□ 
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F Proof of Theorem [9] 

Fix n E N, for all z G S, and y„ = a.rgmm^,^gj\z' — z\\2, we have 

We assume that optimal policies of the original continuous problem are obtainable. By Theorems [Tj- 
[8| we have: 

lim \Jn,^,„iyn) - J*{yn))\ = w.p.l. 

n— >oo 

Thus, fJ-niVn) converges to fJ-*{yn) almost surely where fi* is an optimal policy of the original 
continuous problem. Thus, for all e > 0, there exists such that for all n > N: 

WfJ-niVn) - f^*{yn)\\2 < | W.p.l. 

Under the assumption that fi* is continuous at z, and due to lim„_!.oo Vn = z almost surely, we can 
choose N large enough such that for all n > N: 

\\fj'*{yn) - ^J'*{z)\\2 < I w.p.l. 

From the above inequalities: 

llAin(yn) - fJ'*{z)\\2 < llAiri(yn) " ^*(yn)||2 + ||^*(yn) " fJ-* {z)\\2 < £, Vn > W.p.l. 

Therefore, 

lim ||7Z„(z) - /i*(z)||2 = lim ||/i„(y„) - ^*(z)||2 = w.p.l. 

□ 
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